You are viewing the RapidMiner Studio documentation for version 10.0 - Check here for latest version
Cut Document (Text Processing)
Synopsis
Cuts an input document into segments using regular expressions specifiying start and end of segments.Description
This operator segments a text based on a starting and ending regular expression.
Input
- document
Output
- documents (Collection)
Collection of the segmented document.
Parameters
- query type Specifies the type of the query. The available query types are: String Matching, Regular Expression, Regular Region, Indexed, XPath and JSONPath; Range: selection
- string matching queries Specifies a list of string matching start and end sequences. Everything between will be used as result. See the operator documentation for details on string matching. Range: list
- attribute type Specifies the type of the resulting attributes. If numerical or binomial is chosen, ensure that the returned result is interpretable. The available types are: Nominal, Numerical and Binominal; Range: selection
- regular expression queries Specifies a list of attribute names and their corresponding regular expressions. The first matching group is used as value. See the operator documentation for details on regular expressions. Range: list
- regular region queries Specifies a list of attribute names and their corresponding regular expressions. Two regular expressions might be specified in order to define the start and the end of a region. Everything in between the two matches will be delivered as result. Range: list
- xpath queries Specifies a list of attribute names and their corresponding XPath queries. See the operator documentation for details on XPath. Range: list
- namespaces Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h. Range: list
- ignore CDATA Indicates if CDATA should be ignored when using the XPATH expression. Range: boolean
- assume html If checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this. Range: boolean
- index queries Specifies a list of attribute names and the regions. Regions are specified as offset index and length of the match. Range: list
- jsonpath queries Specifies a list of attribute names and their corresponding JSONPath queries. Range: list